add E2E testing framework by janisz · Pull Request #26 · stackrox/stackrox-mcp

janisz · 2026-01-15T16:51:40Z

Description

Enhanced tool descriptions and parameter schemas to better guide LLMs on when to use optional parameters and which tools to select for different query types. Added mcp-testing-framework configuration with 8 test cases covering CVE queries and cluster operations, achieving 87.5% pass rate with GPT-5 models.

Validation

./scripts/run-tests.sh
══════════════════════════════════════════════════════════
  StackRox MCP E2E Testing with Gevals
══════════════════════════════════════════════════════════

Loading environment variables from .env...
Configuration:
  Agent Model: gpt-4o
  Judge Model: gpt-4o
  MCP Server: stackrox-mcp (via go run)

Running gevals tests...


=== Starting Evaluation ===

Task: list-clusters
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-affecting-workloads
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-affecting-clusters
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-nonexistent
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-scooby
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-maria
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-clusters-general
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

Task: cve-cluster-list
  Difficulty: easy
  → Running agent...
  → Verifying results...
  ✓ Task passed

=== Evaluation Complete ===

📄 Results saved to: gevals-stackrox-mcp-e2e-out.json

=== Results Summary ===

Task: list-clusters
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/list-clusters.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-affecting-workloads
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-affecting-workloads.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-affecting-clusters
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-affecting-clusters.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-nonexistent
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-nonexistent.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-scooby
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-scooby.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-maria
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-maria.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-clusters-general
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-clusters-general.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: cve-cluster-list
  Path: /home/janisz/go/src/github.com/stackrox/stackrox-mcp/e2e-tests/gevals/tasks/cve-cluster-list.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

=== Overall Statistics ===
Total Tasks: 8
Tasks Passed: 8/8
Assertions Passed: 24/24

=== Statistics by Difficulty ===

easy:
  Tasks: 8/8
  Assertions: 24/24

══════════════════════════════════════════════════════════
  Tests Completed Successfully!
══════════════════════════════════════════════════════════

codecov-commenter · 2026-01-15T16:56:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.36%. Comparing base (bc05b10) to head (a29703e).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #26   +/-   ##
=======================================
  Coverage   77.36%   77.36%           
=======================================
  Files          26       26           
  Lines        1162     1162           
=======================================
  Hits          899      899           
  Misses        223      223           
  Partials       40       40

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

e2e-tests/gevals/tasks/cve-affecting-clusters.yaml

internal/toolsets/config/tools.go

internal/toolsets/vulnerability/clusters.go

e2e-tests/mcp-testing-framework.yaml

e2e-tests/gevals/tasks/cve-nonexistent.yaml

e2e-tests/gevals/tasks/list-clusters.yaml

e2e-tests/mcpchecker/eval.yaml

Enhanced tool descriptions and parameter schemas to better guide LLMs on when to use optional parameters and which tools to select for different query types. Added mcp-testing-framework configuration with 8 test cases covering CVE queries and cluster operations, achieving 87.5% pass rate with GPT-5 models. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Tomasz Janiszewski <tomek@redhat.com> # Conflicts: # internal/toolsets/config/tools.go

Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

Fix E2E test assertion failures by improving tool descriptions with smart usage pattern guidance. Tool descriptions now clearly indicate: - When to call all three CVE tools for comprehensive coverage ("Is CVE-X detected in my clusters?" without specific cluster name) - When to call only specific tools for targeted queries ("Is CVE-X detected in cluster staging-central-cluster?") Changes: - Update vulnerability tool descriptions (clusters, deployments, nodes) to use directive language and clear usage patterns - Adjust cve-nonexistent test maxToolCalls from 2 to 3 to match comprehensive check pattern - Update cve-cluster-does-not-exist verification to accept both "CVE not detected" and "cluster doesn't exist" responses Results: All 24/24 E2E test assertions now pass (improved from 21/24). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…criptions Changes: - Switch E2E agent from GPT-4o to Claude Sonnet 4.5 via Vertex AI - Add enableAllTools: true to MCP config for auto-approval - Configure gpt-5-nano as LLM judge for cost efficiency - Improve CVE tool descriptions with clear WHEN TO USE/WHEN NOT TO USE sections - Update test assertions to account for Claude's comprehensive CVE checking behavior - Update run-tests.sh to export Vertex AI environment variables The tool descriptions now explicitly guide when to use each CVE detection tool: - General "clusters" queries → comprehensive check (all 3 tools) - Specific component queries → single relevant tool only - Single cluster queries → orchestrator tool with cluster filter All 8 E2E tests passing with 24/24 assertions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

- Update README.md with complete env var configuration - Fix jq command examples (path and property names) - Add AGENT_MODEL_NAME configuration to run-tests.sh - Clarify cluster ID-only requirement in tool descriptions - Add explanatory comments to eval.yaml about assertion fields - Improve list-clusters verification text - Remove leftover mcp-testing-framework.yaml file Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

mtodor

Looks good! I have added a few questions and thoughts. Nothing crucial.

I didn't review the tasks because we will replace them in a follow-up.

e2e-tests/mcpchecker/eval.yaml

e2e-tests/scripts/run-tests.sh

e2e-tests/scripts/build-gevals.sh

e2e-tests/scripts/run-tests.sh

e2e-tests/README.md

Co-authored-by: Mladen Todorovic <mtodor@gmail.com>

e2e-tests/scripts/run-tests.sh

Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

- Upgrade from gevals v0.0.1 to mcpchecker v0.0.4 - Move e2e-tests Go module to tools/ subdirectory to fix module resolution issue when running MCP server from mcpchecker directory - Rename gevals/ directory to mcpchecker/ - Update build script: build-gevals.sh → build-mcpchecker.sh - Update all references in documentation and scripts - Fix jq commands in README for new mcpchecker JSON structure - Remove gevals dependency from root go.mod - Add Dependabot configuration to monitor both root and e2e-tests/tools modules All tests passing (8/8 tasks, 24/24 assertions). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add smoke test script that validates e2e test configuration without requiring actual agents or API keys. This allows CI to catch configuration errors early. Changes: - Add e2e-tests/scripts/smoke-test.sh to validate: - mcpchecker binary builds - MCP server compiles - YAML configuration files are valid - Task files exist and are parseable - Add .github/workflows/e2e-smoke-test.yml for CI integration - Update README with smoke test section The smoke test runs in <30s and requires no secrets, making it ideal for PR validation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

mtodor

Nice work! 🏆

Added a few nitpicks, nothing crucial or something that we can do in a followup.

e2e-tests/scripts/smoke-test.sh

e2e-tests/scripts/run-tests.sh

- Merge e2e-smoke-test.yml into test.yml to eliminate duplicate builds - Simplify smoke-test.sh to only build and verify mcpchecker binary - Remove MCP server build from smoke test (already built by test workflow) - Remove YAML validation from smoke test (will use yamllint in separate PR) - Add Makefile target for e2e-smoke-test - Add go mod tidy verification using find for all Go modules - Use find for dependency downloads to support multiple modules This addresses PR review feedback and reduces CI build time by avoiding duplicate checkout and build operations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

janisz force-pushed the e2e-tests branch from 5b19c98 to 2868f53 Compare January 15, 2026 16:58

janisz mentioned this pull request Jan 16, 2026

chore(tweak): Tweak tool name and description #25

Merged

1 task

mtodor reviewed Jan 19, 2026

View reviewed changes

janisz marked this pull request as draft January 19, 2026 17:39

janisz force-pushed the e2e-tests branch from bcaaa07 to 6e0ec3d Compare January 19, 2026 18:40

janisz mentioned this pull request Jan 21, 2026

chore: improve tools description #29

Merged

janisz and others added 5 commits January 23, 2026 14:59

use gevals

2cafa48

Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

fix

03fe2e4

Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

janisz force-pushed the e2e-tests branch from b830585 to 03fe2e4 Compare January 23, 2026 14:00

janisz changed the title ~~Improve LLM tool parameter guidance and add E2E testing framework~~ add E2E testing framework Jan 23, 2026

janisz requested a review from mtodor January 29, 2026 15:35

janisz marked this pull request as ready for review January 29, 2026 15:36

mtodor reviewed Feb 2, 2026

View reviewed changes

Update e2e-tests/scripts/build-gevals.sh

5c4ab07

Co-authored-by: Mladen Todorovic <mtodor@gmail.com>

janisz commented Feb 2, 2026

View reviewed changes

e2e-tests/scripts/run-tests.sh Outdated Show resolved Hide resolved

janisz and others added 5 commits February 2, 2026 17:51

Apply suggestion from @janisz

a9fffdb

Apply suggestion from @janisz

9c7a6e1

fix

6d9a913

Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>

janisz requested a review from mtodor February 3, 2026 17:12

mtodor approved these changes Feb 3, 2026

View reviewed changes

e2e-tests/scripts/smoke-test.sh Outdated Show resolved Hide resolved

e2e-tests/scripts/smoke-test.sh Outdated Show resolved Hide resolved

e2e-tests/scripts/smoke-test.sh Outdated Show resolved Hide resolved

e2e-tests/scripts/run-tests.sh Outdated Show resolved Hide resolved

janisz merged commit cb19cfb into main Feb 5, 2026
4 checks passed

janisz deleted the e2e-tests branch February 5, 2026 15:31

Conversation

janisz commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation

Uh oh!

codecov-commenter commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtodor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtodor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janisz commented Jan 15, 2026 •

edited

Loading

codecov-commenter commented Jan 15, 2026 •

edited

Loading